library(dplyr)
library(ggplot2)
library(shiny)
library(shinyWidgets)
library(gridExtra)
library(tidyverse)
library(readxl)
library(rpart)
library(rpart.plot)
library(ROSE)
library(knitr)
For the purpose of classification, we want to classify
highest_yearly_earnings of each Youtuber. We will find the
information of GDP of each country and create the target variable
status based on the GDP of that YouTuber’s country.
df <- read.csv('youtube_UTF_8.csv', na.strings = c('nan'))
df <- df[, -c(6, 9, 11:13, 15:17, 21:28)] # Remove redundant columns
highest_yearly_earnings has a high correlation to
lowest_monthly_earnings,
highest_monthly_earnings, and
lowest_yearly_earnings, so we will remove them from this
analysis. Some redundant columns such as Title and
Abbreviation will be excluded. Lastly, we will remove the
columns created_month, created_date,
Gross.tertiary.education.enrollment....,
Population, Unemployment.rate,
Urban_population, Latitude, and
Longitude because the last six columns are directly related
to Country, so we should only keep
Country.
# Function to count NAs in each columns
count_na_in_column <- function(data) {
sapply(data, FUN = function(col) sum(is.na(col)))}
count_na_in_column(df)
## rank Youtuber
## 0 0
## subscribers video.views
## 0 0
## category uploads
## 46 0
## Country channel_type
## 122 30
## video_views_for_the_last_30_days highest_yearly_earnings
## 56 0
## subscribers_for_last_30_days created_year
## 337 5
There are 122 missing entries in the Country variable,
and we need to impute these values since we rely on the GDP information
based on the Country. Our chosen method for imputation is
to use the mode of the Country variable to fill in the
missing values.
Country# Extract the name of the country
country_name <- df$Country %>%
table() %>%
names()
# Extract the name of the country with the maximum frequency
country_mode <- country_name[which.max(table(df$Country))]
# Impute missing country with mode
df <- df %>%
mutate(Country = ifelse(is.na(Country), country_mode, Country))
subscribers_for_last_30_days,
video_views_for_the_last_30_daysreplace_na_with_median <- function(data, columns) {
for (col in columns) {
x <- data[[col]]
non_na_values <- x[!is.na(x)]
median_value <- median(non_na_values)
is_bad <- ifelse(is.na(x), 1, 0) # isBAD take value 1 when NA is replaced by median
data[[col]][is.na(data[[col]])] <- median_value
data[[paste0(col, "_isBAD")]] <- is_bad
}
return(data)
}
columns_to_replace <- c("subscribers_for_last_30_days", "video_views_for_the_last_30_days")
df <- replace_na_with_median(df, columns_to_replace)
category and
channel_typereplace_missing_with_mapping <- function(data, level_mapping) {
for (i in seq_len(nrow(level_mapping))) {
category_level <- level_mapping$category_level[i]
channel_level <- level_mapping$channel_level[i]
data$channel_type[is.na(data$channel_type) & data$category == category_level] <- channel_level
data$category[is.na(data$category) & data$channel_type == channel_level] <- category_level
}
return(data)
}
# Create mapping of similar level of category and channel_type
level_mapping <- data.frame(
category_level = c("Pets & Animals", "Autos & Vehicles", "Comedy", "Education", "Entertainment", "Film & Animation", "Gaming", "Howto & Style", "Music", "News & Politics", "Nonprofits & Activism", "People & Blogs", "Sports", "Science & Technology"),
channel_level = c("Animals", "Autos", "Comedy", "Education", "Entertainment", "Film", "Games", "Howto", "Music", "News", "Nonprofit", "People", "Sports", "Tech"))
df <- replace_missing_with_mapping(df, level_mapping)
count_na_in_column(df)
## rank Youtuber
## 0 0
## subscribers video.views
## 0 0
## category uploads
## 3 0
## Country channel_type
## 0 4
## video_views_for_the_last_30_days highest_yearly_earnings
## 0 0
## subscribers_for_last_30_days created_year
## 0 5
## subscribers_for_last_30_days_isBAD video_views_for_the_last_30_days_isBAD
## 0 0
df <- na.omit(df) # Omit the row with missing values
Countrygdp <- read_excel("GDP.xls", na = "no data")
gdp <- as.data.frame(gdp) # Convert tibble to dataframe
gdp <- gdp[-1, ] # Remove an empty first row
new_column_names <- c("Country", as.character(1980:2028))
gdp <- gdp %>%
rename_with(~new_column_names, .cols = everything()) # Rename columns
# Verify the consistency of country names between the YouTube data and GDP data
country_name %in% gdp$Country %>%
sum() - length(country_name)
## [1] -5
country_name[which(!country_name %in% gdp$Country)]
## [1] "China" "Cuba" "Russia" "South Korea" "Turkey"
We found that there are 5 countries in the YouTube dataset that are not in the GDP dataset. After checking the completeness of this dataset, we have to rename 4 countries. However, this dataset does not have the GDP of Cuba, so we need to find another data source.
gdp <- gdp %>%
mutate(Country = gsub("China, People's Republic of", "China", Country)) %>%
mutate(Country = gsub("Korea, Republic of", "South Korea", Country)) %>%
mutate(Country = gsub("Russian Federation", "Russia", Country)) %>%
mutate(Country = gsub("Türkiye, Republic of", "Turkey", Country))
gdp[gdp$Country %in% levels(as.factor(df$Country)), c("Country", "2020", "2021", "2022", "2023")]
## Country 2020 2021 2022 2023
## 2 Afghanistan 611.268 NA NA NA
## 5 Andorra 36973.845 41873.060 41931.030 44386.988
## 8 Argentina 8571.937 10616.947 13655.200 13709.490
## 11 Australia 53071.718 63896.297 65526.118 64964.282
## 16 Bangladesh 2270.348 2497.741 2730.846 2469.577
## 17 Barbados 16379.468 16864.318 19579.156 21085.944
## 26 Brazil 6970.719 7754.611 8995.029 9673.093
## 34 Canada 43383.713 52387.812 55085.452 52722.484
## 37 Chile 13067.742 16059.804 15094.829 17827.414
## 38 China 10525.001 12572.071 12813.765 13721.051
## 39 Colombia 5363.072 6239.274 6664.269 6417.047
## 52 Ecuador 5670.330 5978.915 6462.217 6642.700
## 53 Egypt 3802.438 4145.939 4563.298 3644.255
## 54 El Salvador 3903.414 4551.161 4987.903 5308.092
## 61 Finland 49168.162 53595.186 50655.133 54351.246
## 62 France 40385.395 45185.845 42409.045 44408.417
## 66 Germany 46735.314 51237.643 48636.030 51383.504
## 79 India 1913.220 2234.337 2379.209 2601.362
## 80 Indonesia 3932.332 4362.677 4798.119 5016.638
## 82 Iraq 4219.874 5015.489 6399.688 6180.517
## 85 Italy 31784.787 35842.069 34113.201 36812.317
## 87 Japan 40117.916 39882.554 33821.931 35385.073
## 88 Jordan 4336.382 4460.880 4740.941 5048.459
## 92 South Korea 31728.304 34997.985 32250.409 33393.074
## 94 Kuwait 22683.638 28884.300 38328.894 33646.139
## 97 Latvia 18123.655 20996.889 22347.970 25136.162
## 107 Malaysia 10361.276 11449.782 12364.058 13382.408
## 114 Mexico 8533.496 9869.076 10867.807 12673.634
## 119 Morocco 3375.265 3934.310 3764.720 3748.598
## 125 Netherlands 52222.364 57996.912 56489.068 61098.871
## 133 Pakistan 1376.512 1564.434 1658.363 NA
## 138 Peru 6145.031 6678.850 7094.448 7772.959
## 139 Philippines 3325.836 3576.101 3623.389 3905.483
## 145 Russia 10180.670 12617.864 15443.829 14403.573
## 150 Samoa 4371.767 4224.921 4140.002 4436.908
## 152 Saudi Arabia 20971.147 25463.654 31849.729 29922.089
## 157 Singapore 61274.006 77710.070 82807.649 91100.374
## 164 Spain 26943.767 30133.937 29420.615 31223.354
## 168 Sweden 52706.294 60929.619 55689.403 55395.444
## 169 Switzerland 85872.732 92238.988 92371.449 98767.334
## 175 Thailand 7170.849 7226.837 7650.884 8181.928
## 183 Turkey 8612.310 9654.085 10618.285 11931.502
## 185 Ukraine 3780.075 4882.111 4348.570 4653.892
## 186 United Arab Emirates 37648.963 43422.085 51305.686 49451.635
## 187 United Kingdom 40347.364 46421.611 45294.805 46371.452
## 188 United States 63577.341 70159.774 76348.494 80034.581
## 192 Venezuela 1566.623 2071.603 3459.154 3640.812
## 193 Vietnam 3548.892 3753.428 4086.519 4475.507
46 countries have the latest (2023) GDP data, except for Pakistan and Afghanistan. Therefore, we will use the most recent available GDP figures, which are from 2022 for Pakistan and 2020 for Afghanistan.
gdp2023 <- gdp[, c(1, 45)]
gdp2022 <- subset(gdp, Country == "Pakistan", select = c("Country", "2022"))
gdp2020 <- subset(gdp, Country == "Afghanistan", select = c("Country", "2020"))
We joined the GDP dataset to the YouTube dataset using
Country as the key. We used a left_join()
because there is still missing GDP data for Cuba.
df_gdp <- left_join(df, gdp2023, by = "Country")
df_gdp <- df_gdp %>%
rename(GDP = `2023`) # Rename GDP column
df_gdp$GDP <- ifelse(df_gdp$Country == "Afghanistan", gdp2020[1, 2],
ifelse(df_gdp$Country == "Pakistan", gdp2022[1, 2], df_gdp$GDP))
worldbank <- read_excel("gdp_worldbank.xls")
worldbank <- worldbank[-c(1:2), ] %>%
setNames(.[1, ]) # Make the first row as column names
worldbank <- worldbank[-1, ] %>%
as.data.frame()
We used another data source for the GDP of Cuba, called “gdp_worldbank.xls”. The latest available GDP for Cuba is from 2020. We have now completed the process of finding GDP for every country in the YouTube dataset.
subset(worldbank, `Country Name` == "Cuba", select = c(`2020`, `2021`, `2022`))
## 2020 2021 2022
## 51 9499.5902023043182 <NA> <NA>
cuba_gdp <- subset(worldbank, `Country Name` == "Cuba", select = c(`Country Name`, `2020`))
df_gdp$GDP <- ifelse(df_gdp$Country == "Cuba", cuba_gdp[1, 2], df_gdp$GDP)
sum(is.na(df_gdp$GDP)) # All GDP is inserted
## [1] 0
rm(gdp2020, gdp2022, gdp2023, cuba_gdp, gdp, worldbank) # Remove unused dataframe
df_gdp <- df_gdp %>%
mutate(GDP = as.numeric(GDP)) %>%
mutate(created_year = as.factor(created_year)) %>%
mutate(subscribers_for_last_30_days_isBAD = as.factor(subscribers_for_last_30_days_isBAD)) %>%
mutate(video_views_for_the_last_30_days_isBAD = as.factor(video_views_for_the_last_30_days_isBAD))
statusdf_gdp$status <- ifelse(df_gdp$GDP < df_gdp$highest_yearly_earnings, 1, 0)
table(df_gdp$status)
##
## 0 1
## 169 817
The target variable status is created with 1 indicating
high income earning and 0 indicating low income earning. There are 817
high income-earning YouTubers against 169 low income-earning ones. This
is not balanced; therefore, we will use the ROSE package for the
generation of synthetic data by Randomly Oversampling Examples to
balance the status variable.
statusdf_gdp.rose <- df_gdp[, -c(1:2, 8, 10, 15, 17)]
# Convert the Country column to a be numeric column
country_mapping <- c(
"Afghanistan" = 1, "Andorra" = 2, "Argentina" = 3, "Australia" = 4,
"Bangladesh" = 5, "Barbados" = 6, "Brazil" = 7, "Canada" = 8,
"Chile" = 9, "China" = 10, "Colombia" = 11, "Cuba" = 12,
"Ecuador" = 13, "Egypt" = 14, "El Salvador" = 15, "Finland" = 16,
"France" = 17, "Germany" = 18, "India" = 19, "Indonesia" = 20,
"Iraq" = 21, "Italy" = 22, "Japan" = 23, "Jordan" = 24,
"Kuwait" = 25, "Latvia" = 26, "Malaysia" = 27, "Mexico" = 28,
"Morocco" = 29, "Netherlands" = 30, "Pakistan" = 31, "Peru" = 32,
"Philippines" = 33, "Russia" = 34, "Samoa" = 35, "Saudi Arabia" = 36,
"Singapore" = 37, "South Korea" = 38, "Spain" = 39, "Sweden" = 40,
"Switzerland" = 41, "Thailand" = 42, "Turkey" = 43, "Ukraine" = 44,
"United Arab Emirates" = 45, "United Kingdom" = 46, "United States" = 47,
"Venezuela" = 48, "Vietnam" = 49)
df_gdp.rose$Country <- country_mapping[df_gdp.rose$Country]
df_gdp.rose$Country <- as.factor(df_gdp.rose$Country)
# Convert the category column to a be numeric column
category_mapping <- c(
"Autos & Vehicles" = 1, "Comedy" = 2, "Education" = 3, "Entertainment" = 4,
"Film & Animation" = 5, "Gaming" = 6, "Howto & Style" = 7, "Movies" = 8,
"Music" = 9, "News & Politics" = 10, "Nonprofits & Activism" = 11, "People & Blogs" = 12,
"Pets & Animals" = 13, "Science & Technology" = 14, "Shows" = 15, "Sports" = 16,
"Trailers" = 17, "Travel & Events" = 18)
df_gdp.rose$category <- category_mapping[df_gdp.rose$category]
df_gdp.rose$category <- as.factor(df_gdp.rose$category)
df.rose <- ROSE(status ~ ., data = df_gdp.rose, seed=123)$data
table(df.rose$status)
##
## 0 1
## 488 498
df.rose$status <- as.factor(df.rose$status)
ROSE handles only continuous and categorical variables, so we
converted Country and category into numeric
ones. After applying ROSE, we achieved a balanced dataset.
# a 90/10 split to form the training and test sets
set.seed(729375)
df.rose$rgroup <- runif(dim(df.rose)[1])
dTrainAll <- subset(df.rose, rgroup<=0.9)
dTest <- subset(df.rose, rgroup>0.9)
outcome <- c('status')
pos <- '1'
# names of columns that are categorical type and numerical type
vars <- colnames(df.rose)[!(colnames(df.rose) %in% c("status", "rgroup"))]
catVars <- vars[sapply(dTrainAll[, vars], class) %in% c('factor', 'character')]
numericVars <- vars[sapply(dTrainAll[, vars], class) %in% c('numeric', 'integer')]
# split dTrainAll into a training set and a validation (or calibration) set
useForCal <- rbinom(n=dim(dTrainAll)[1], size=1, prob=0.1)>0
dCal <- subset(dTrainAll, useForCal)
dTrain <- subset(dTrainAll, !useForCal)
cat("Country list in dTrain:")
## Country list in dTrain:
table(dTrain$Country)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 4 2 11 11 3 2 39 12 9 1 4 6 2 2 0 0 2 5 116 16
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 0 5 0 0 0 0 30 1 2 5 0 6 12 2 3 6 9 22 0
## 41 42 43 44 45 46 47 48 49
## 0 13 3 5 3 37 403 1 0
cat("Country list in dTest:")
## Country list in dTest:
table(dTest$Country)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 0 1 1 2 0 0 3 0 0 0 0 1 0 0 0 0 1 0 10 3 0 0 1 0 0 0
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## 1 2 0 0 2 0 1 0 0 0 1 1 6 0 0 2 0 0 0 3 55 0 0
dTest <- dTest %>% filter(Country != 27)
At country level 27 (Malaysia), the training data does not include that level, while the test data does. This mismatch could potentially cause issues during predictions on the test dataset. Therefore, we need to exclude that observation from the test data.
# Function to calculate predictions for categorical variables
mkPredC <- function(outCol, varCol, appCol) {
pPos <- sum(outCol == pos) / length(outCol)
naTab <- table(as.factor(outCol[is.na(varCol)]))
pPosWna <- (naTab/sum(naTab))[pos]
vTab <- table(as.factor(outCol), varCol)
pPosWv <- (vTab[pos, ] + 1.0e-3*pPos) / (colSums(vTab) + 1.0e-3)
pred <- pPosWv[appCol]
pred[is.na(appCol)] <- pPosWna
pred[is.na(pred)] <- pPos
pred
}
for(v in catVars) {
pi <- paste('pred', v, sep='')
dTrain[,pi] <- mkPredC(dTrain[,outcome], dTrain[,v], dTrain[,v])
dCal[,pi] <- mkPredC(dTrain[,outcome], dTrain[,v], dCal[,v])
dTest[,pi] <- mkPredC(dTrain[,outcome], dTrain[,v], dTest[,v])
}
# Function to evaluate prediction through AUC score
library('ROCR')
calcAUC <- function(predcol,outcol) {
perf <- performance(prediction(predcol,outcol==pos),'auc')
as.numeric(perf@y.values)
}
for(v in catVars) {
pi <- paste('pred', v, sep='')
aucTrain <- calcAUC(dTrain[,pi], dTrain[,outcome])
if (aucTrain >= 0.5) {
aucCal <- calcAUC(dCal[,pi], dCal[,outcome])
print(sprintf("%s: trainAUC: %4.3f; calibrationAUC: %4.3f", pi, aucTrain, aucCal))
}
}
## [1] "predcategory: trainAUC: 0.589; calibrationAUC: 0.626"
## [1] "predCountry: trainAUC: 0.734; calibrationAUC: 0.676"
## [1] "predcreated_year: trainAUC: 0.695; calibrationAUC: 0.700"
## [1] "predsubscribers_for_last_30_days_isBAD: trainAUC: 0.639; calibrationAUC: 0.709"
## [1] "predvideo_views_for_the_last_30_days_isBAD: trainAUC: 0.632; calibrationAUC: 0.640"
The AUC results on the training and calibration dataset represent the
ability of a model to distinguish between the positive and negative
classes. An AUC of 0.5 indicates that the model has no discriminatory
power and is equivalent to random guessing. All the categorical
predictors have AUCs greater than 0.5, meaning all of them have some
level of discriminatory power. The predictor with the best
discriminatory power on the training data is predCountry
with an AUC of 0.734. While the predictor
predsubscribers_for_last_30_days_isBAD has the best
performance on unseen data with an AUC of 0.709.
# 100 fold cross validation
for (v in catVars) {
aucs <- rep(0,100)
for (rep in 1:length(aucs)) {
useForCalRep <- rbinom(n=nrow(dTrainAll), size=1, prob=0.1) > 0
predRep <- mkPredC(dTrainAll[!useForCalRep, outcome],
dTrainAll[!useForCalRep, v],
dTrainAll[useForCalRep, v])
aucs[rep] <- calcAUC(predRep, dTrainAll[useForCalRep, outcome])
}
print(sprintf("%s: mean: %4.3f; sd: %4.3f", v, mean(aucs), sd(aucs)))
}
## [1] "category: mean: 0.558; sd: 0.053"
## [1] "Country: mean: 0.700; sd: 0.051"
## [1] "created_year: mean: 0.673; sd: 0.054"
## [1] "subscribers_for_last_30_days_isBAD: mean: 0.646; sd: 0.048"
## [1] "video_views_for_the_last_30_days_isBAD: mean: 0.637; sd: 0.035"
In order to gain a more robust performance metric, we apply the 100
fold cross validation where the dataset is split multiple times so we
can assess how the model would perform on various different portions of
the data and get a better understanding of how the model might perform
on unseen data. From the results above, the predictor
Country shows the strongest discriminatory power with a
mean AUC of 0.70. Overall, all predictors performed consistently as
evidenced by the relatively small standard deviations.
# Double density plot
str(factor(dTrain[,"category"]))
## Factor w/ 17 levels "1","2","3","4",..: 2 9 12 6 4 9 5 4 6 4 ...
str(factor(dTrain[,"Country"]))
## Factor w/ 38 levels "1","2","3","4",..: 17 7 37 36 17 17 37 17 17 37 ...
str(factor(dTrain[,"created_year"]))
## Factor w/ 18 levels "2005","2006",..: 16 10 13 11 16 2 4 10 13 14 ...
fig1 <- ggplot(dCal) + geom_density(aes(x=predcategory, color=as.factor(status)))
fig2 <- ggplot(dCal) + geom_density(aes(x=predCountry, color=as.factor(status)))
fig3 <- ggplot(dCal) + geom_density(aes(x=predcreated_year, color=as.factor(status)))
grid.arrange(fig1, fig2, fig3, ncol=2)
In order to visualize the distribution of the predicted values
between the two classes, we created a density plot for each variable.
The sharp peak in the blue curve of the
predcreated_year vs. status plot, signifies a dominant
presence of high earners compared to low earners especially when its
around the value of 0.5. In contrast, there is a significant amount of
high earners and low earners in the predcategory vs. status
and predCountry vs. status plot where the value is around
0.4-0.5 as evidenced by the sharp peaks at these points. However, these
two predictors have a significant overlap which makes it more complex to
differentiate between positive and negative classes.
# ROC Curve
library(ROCit)
plot_roc <- function(predcol, outcol, colour_id=2, overlaid=F) {
ROCit_obj <- rocit(score=predcol, class=outcol==pos)
par(new=overlaid)
plot(ROCit_obj, col = c(colour_id, 1),
legend = FALSE, YIndex = FALSE, values = FALSE)
}
plot_roc(dTest$predcategory, dTest[,outcome]) # red
plot_roc(dTest$predCountry, dTest[,outcome], colour_id=3, overlaid=T) # green
plot_roc(dTest$predcreated_year, dTest[,outcome], colour_id=5, overlaid=T) # sky blue
The ROC plot above provides a visual analysis to examine the
trade-off between the true positive rate and the false positive rate of
the classifier. The predcreated_year model has little to no
ability to discriminate between the positive and negative classes as it
closely follows the dashed line, which represents no discriminatory
power. On the other hand, the model being represented by the green curve
predCountry has a good discriminative ability, particularly
at thresholds where the curve is steepest. It is considerably better
than random guessing due to its substantial departure from the diagonal
line, which means that it has a lower false positive rate than a random
classifier.
# Function to calculate predictions for numerical variables
mkPredN <- function(outCol, varCol, appCol) {
# compute the cuts
cuts <- unique(
quantile(varCol, probs=seq(0, 1, 0.1), na.rm=T))
# discretize the numerical columns
varC <- cut(varCol,cuts)
appC <- cut(appCol,cuts)
mkPredC(outCol,varC,appC)
}
for (v in numericVars) {
pi <- paste('pred', v, sep='')
dTrain[,pi] <- mkPredN(dTrain[,outcome], dTrain[,v], dTrain[,v])
dCal[,pi] <- mkPredN(dTrain[,outcome], dTrain[,v], dCal[,v])
dTest[,pi] <- mkPredN(dTrain[,outcome], dTrain[,v], dTest[,v])
aucTrain <- calcAUC(dTrain[,pi], dTrain[,outcome])
if(aucTrain >= 0.55) {
aucCal <- calcAUC(dCal[,pi], dCal[,outcome])
print(sprintf("%s: trainAUC: %4.3f; calibrationAUC: %4.3f", pi, aucTrain, aucCal))
}
}
## [1] "predsubscribers: trainAUC: 0.584; calibrationAUC: 0.571"
## [1] "predvideo.views: trainAUC: 0.591; calibrationAUC: 0.572"
## [1] "preduploads: trainAUC: 0.843; calibrationAUC: 0.815"
## [1] "predvideo_views_for_the_last_30_days: trainAUC: 0.728; calibrationAUC: 0.718"
## [1] "predsubscribers_for_last_30_days: trainAUC: 0.777; calibrationAUC: 0.789"
All the numerical predictors have AUCs greater than 0.5, meaning all
of them have some level of discriminatory power. The predictor with the
strongest discriminatory power in the training data is
preduploads with an AUC of 0.843. Notably, its performance
on unseen data stands out with an AUC of 0.815. Overall, this predictor
consistently demonstrate robust capabilities.
for (v in numericVars) {
aucs <- rep(0,100)
for (rep in 1:length(aucs)) {
useForCalRep <- rbinom(n=nrow(dTrainAll), size=1, prob=0.1) > 0
predRep <- mkPredN(dTrainAll[!useForCalRep, outcome],
dTrainAll[!useForCalRep, v],
dTrainAll[useForCalRep, v])
aucs[rep] <- calcAUC(predRep, dTrainAll[useForCalRep, outcome])
}
print(sprintf("%s: mean: %4.3f; sd: %4.3f", v, mean(aucs), sd(aucs)))
}
## [1] "subscribers: mean: 0.555; sd: 0.057"
## [1] "video.views: mean: 0.572; sd: 0.062"
## [1] "uploads: mean: 0.831; sd: 0.044"
## [1] "video_views_for_the_last_30_days: mean: 0.702; sd: 0.060"
## [1] "subscribers_for_last_30_days: mean: 0.776; sd: 0.044"
After conducting a 100-fold cross-validation on the numerical
predictors, uploads demonstrates the strongest ability to
differentiate between classes, with a mean of 0.831 and a very small
standard deviation of 0.044 while all predictors show reasonable
performance consistency across the 100 cross-validation folds.
fig1 <- ggplot(dCal) + geom_density(aes(x=predsubscribers, color=as.factor(status)))
fig2 <- ggplot(dCal) + geom_density(aes(x=predvideo.views, color=as.factor(status)))
fig3 <- ggplot(dCal) + geom_density(aes(x=preduploads, color=as.factor(status)))
fig4 <- ggplot(dCal) + geom_density(aes(x=predvideo_views_for_the_last_30_days, color=as.factor(status)))
fig5 <- ggplot(dCal) + geom_density(aes(x=predsubscribers_for_last_30_days, color=as.factor(status)))
grid.arrange(fig1, fig2, fig3, fig4, fig5, ncol=2)
The plots for predsubscribers and
predvideo.views exhibit overlapping curves, which
highlights the complexity in differentiating between high and low
earning channels based solely on these individual variables. This
overlap demonstrates the importance of considering additional variables
or features for a more accurate classification. In contrast, the
preduploads graph distinctly underscores its efficacy in
discerning between the two classes. With its blue curve peaking around
the 0.8 mark, it suggests that channels with high predicted upload
counts often correlate with high earnings. Shifting the focus to the
predvideo_views_for_the_last_30_days plot, a peak in the
blue curve around the 0.6 mark demonstrates a clear association between
channels that have gained a number of predicted video views in the last
30 days and higher earnings. Furthermore, the
predsubscribers_for_last_30_days graph illustrates the
trend where channels with a notable increase in subscribers, evidenced
by the peak around the 0.8 mark, suggests high earnings.
calcAUC(dTrain[,"predsubscribers"], dTrain[,outcome]); calcAUC(dCal[,"predvideo.views"], dCal[,outcome]); calcAUC(dCal[,"preduploads"], dCal[,outcome]); calcAUC(dCal[,"predvideo_views_for_the_last_30_days"], dCal[,outcome]); calcAUC(dCal[,"predsubscribers_for_last_30_days"], dCal[,outcome])
## [1] 0.5839665
## [1] 0.5717054
## [1] 0.8147287
## [1] 0.7182171
## [1] 0.7887597
plot_roc(dTest$predsubscribers, dTest[,outcome]) #red
plot_roc(dTest$predvideo.views, dTest[,outcome], colour_id=3, overlaid=T) #green
plot_roc(dTest$preduploads, dTest[,outcome], colour_id=4, overlaid=T) #blue
plot_roc(dTest$predvideo_views_for_the_last_30_days, dTest[,outcome], colour_id=5, overlaid=T) #skyblue
plot_roc(dTest$predsubscribers_for_last_30_days, dTest[,outcome], colour_id=6, overlaid=T) #purple
df.roc <- dTest
Based on the insights derived from the ROC plot for the numerical
variables, it’s evident that each variable possesses a distinct
discriminatory capability. The predictor predvideo.views,
symbolized by the green curve, exhibits moments where it follows the
dashed line if not slightly below it. In contrast, both
preduploads and
predsubscribers_for_last_30_days are standout predictors.
Their curves, prominently lean towards the left side of the plot, which
signifies a performance that is exceptionally better than random
guessing. This pronounced deviation from the dashed line highlights the
heightened true positive rate of these two predictors compared to a
random classifier.
We employed two techniques for feature selection: the Boruta package and the Fisher score. As the Fisher score requires numeric features, we converted all features to numeric format and applied both techniques to ensure consistency.
df.rose <- df.rose %>%
mutate(subscribers_for_last_30_days_isBAD = as.numeric(subscribers_for_last_30_days_isBAD)) %>%
mutate(video_views_for_the_last_30_days_isBAD = as.numeric(video_views_for_the_last_30_days_isBAD)) %>%
mutate(created_year = as.numeric(created_year)) %>%
mutate(Country = as.numeric(Country)) %>%
mutate(category = as.numeric(category))
To ensure that the training, calibration, and test sets remain consistent with the conversion of features to numeric format, we regenerated the split from the updated data.
# a 90/10 split to form the training and test sets.
set.seed(729375)
df.rose$rgroup <- runif(dim(df.rose)[1])
dTrainAll <- subset(df.rose, rgroup<=0.9)
dTest <- subset(df.rose, rgroup>0.9)
outcome <- c('status')
pos <- "1"
# names of columns that are categorical type and numerical type
vars <- colnames(df.rose)[!(colnames(df.rose) %in% c("status", "rgroup"))]
catVars <- vars[sapply(dTrainAll[, vars], class) %in% c('factor', 'character')]
numericVars <- vars[sapply(dTrainAll[, vars], class) %in% c('numeric', 'integer')]
# split dTrainAll into a training set and a validation (or calibration) set
useForCal <- rbinom(n=dim(dTrainAll)[1], size=1, prob=0.1)>0
dCal <- subset(dTrainAll, useForCal)
dTrain <- subset(dTrainAll, !useForCal)
library(Boruta)
selected_features <- df.rose[, c("status", vars), drop = FALSE]
boruta_result <- Boruta(status ~ ., data = selected_features, doTrace = 0)
roughFixMod <- TentativeRoughFix(boruta_result)
## Warning in TentativeRoughFix(boruta_result): There are no Tentative attributes!
## Returning original object.
boruta_signif <- getSelectedAttributes(roughFixMod)
attStats(boruta_result)
## meanImp medianImp minImp maxImp
## subscribers 6.349151 6.414727 4.45378 7.938653
## video.views 7.748980 8.309504 5.33165 9.070138
## category 10.404167 10.210467 9.53521 11.594118
## uploads 44.667995 44.902370 41.13901 46.418478
## Country 12.799833 12.539001 11.91663 13.988078
## video_views_for_the_last_30_days 34.649011 34.721749 31.81485 36.659204
## subscribers_for_last_30_days 51.156220 51.085080 48.76587 54.037692
## created_year 16.262215 16.043087 15.24773 18.496473
## subscribers_for_last_30_days_isBAD 19.285774 19.190617 17.99441 20.706737
## video_views_for_the_last_30_days_isBAD 32.567822 32.938428 30.74729 33.847957
## normHits decision
## subscribers 1 Confirmed
## video.views 1 Confirmed
## category 1 Confirmed
## uploads 1 Confirmed
## Country 1 Confirmed
## video_views_for_the_last_30_days 1 Confirmed
## subscribers_for_last_30_days 1 Confirmed
## created_year 1 Confirmed
## subscribers_for_last_30_days_isBAD 1 Confirmed
## video_views_for_the_last_30_days_isBAD 1 Confirmed
All the features listed in the table have been marked as “Confirmed”,
which indicates that all of the predictors are significant for
predicting the status outcome.
plot(boruta_result, cex.axis=.7, las = 2, xlab = "", main = "Variable Importance")
We decided to use the top seven predictors generated by the Boruta
technique for the first combination of attributes, excluding the
predictors category, video.views, and
subscribers.
# Fisher score
numericVars <- vars[sapply(df.rose[, vars], class) %in% c('numeric', 'integer')]
selected_numfeatures <- df.rose[, numericVars, drop = FALSE]
target_variable <- as.factor(df.rose$status)
# Function to calculate Fisher's Score for a single feature
fisher_score <- function(feature, target) {
means <- tapply(feature, target, mean)
variances <- tapply(feature, target, var)
between_class_variance <- sum(table(target) * (means - mean(feature))^2) / (length(target) - 1)
within_class_variance <- sum((table(target) - 1) * variances) / (length(target) - length(unique(target)))
fisher_score <- between_class_variance / within_class_variance
return(fisher_score)
}
# Calculate Fisher's Scores for all features
fisher_scores <- sapply(selected_numfeatures, function(feature) fisher_score(feature, target_variable))
kable(fisher_scores)
| x | |
|---|---|
| subscribers | 0.0003266 |
| video.views | 0.0076178 |
| category | 0.0012865 |
| uploads | 0.0219873 |
| Country | 0.0311487 |
| video_views_for_the_last_30_days | 0.0019176 |
| subscribers_for_last_30_days | 0.0441089 |
| created_year | 0.0001115 |
| subscribers_for_last_30_days_isBAD | 0.0996885 |
| video_views_for_the_last_30_days_isBAD | 0.1793985 |
The Fisher scores provide a measure of how well each numerical
feature can discriminate between high earners and low earners. The
predictors video_views_for_the_last_30_days_isBAD,
subscribers_for_last_30_days_isBAD, and
subscribers_for_last_30_days are some of the features with
relatively higher discriminatory power, among all other predictors.
fisher_data <- data.frame(Feature = colnames(selected_numfeatures), Score = fisher_scores)
ggplot(fisher_data, aes(x = Feature, y = Score)) +
geom_bar(stat = "identity", fill = "gray") +
labs(title = "Fisher's Scores", x = "Features", y = "Scores") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We have opted to retain the same features as in the initial attribute
combination, excluding created_year and substituting it
with video.views.
Based on the results of the feature selection process, we have two sets of attributes as follows. We will use decision trees on these two sets.
boruta.vars <- colnames(df.rose)[(colnames(df.rose) %in% c("video_views_for_the_last_30_days_isBAD", "video_views_for_the_last_30_days", "subscribers_for_last_30_days", "subscribers_for_last_30_days_isBAD", "uploads", "Country", "created_year"))]
# boruta.vars <- colnames(df.rose)[(colnames(df.rose) %in% c("subscribers_for_last_30_days", "uploads"))]
boruta.formula <- paste('as.factor(status) ~ ',
paste(c(boruta.vars), collapse=' + '),
sep='')
dt.boruta <- rpart(formula = boruta.formula, data = dTrain)
fisher.vars <- colnames(df.rose)[(colnames(df.rose) %in% c("video_views_for_the_last_30_days_isBAD", "subscribers_for_last_30_days_isBAD", "video_views_for_the_last_30_days", "subscribers_for_last_30_days", "uploads", "Country", "video.views"))]
fisher.formula <- paste('as.factor(status) ~ ',
paste(c(fisher.vars), collapse=' + '),
sep='')
dt.fisher <- rpart(formula = fisher.formula, data = dTrain)
rpart.plot(dt.boruta)
rpart.plot(dt.fisher)
We can observe that the decision tree plots are identical for these
two sets of attributes. However, the variable importance differs between
the two sets. In the Boruta feature set, the important variables are
uploads, subscribers_for_last_30_days, and
video_views_for_the_last_30_days, with importance values of
53, 43, and 4, respectively. On the other hand, in the Fisher feature
set, the important variables are uploads,
subscribers_for_last_30_days,
video_views_for_the_last_30_days, and
video.views, with importance values of 52, 42, 4, and 2,
respectively.
predicted_train <- predict(dt.boruta, newdata=dTrain[boruta.vars], type = "class")
predicted_test <- predict(dt.boruta, newdata=dTest[boruta.vars], type = "class")
predicted_cal <- predict(dt.boruta, newdata=dCal[boruta.vars], type = "class")
calculate_accuracy <- function(T1) {
diagonal_sum <- sum(diag(T1))
total_sum <- sum(T1)
accuracy <- diagonal_sum/total_sum
return(accuracy)
}
table(dTrain$status, predicted_train)
## predicted_train
## 0 1
## 0 358 35
## 1 65 358
table(dTest$status, predicted_test)
## predicted_test
## 0 1
## 0 49 3
## 1 4 41
table(dCal$status, predicted_cal)
## predicted_cal
## 0 1
## 0 38 5
## 1 6 24
calculate_accuracy(table(dTrain$status, predicted_train))
## [1] 0.877451
calculate_accuracy(table(dTest$status, predicted_test))
## [1] 0.9278351
calculate_accuracy(table(dCal$status, predicted_cal))
## [1] 0.8493151
predicted_train <- predict(dt.fisher, newdata=dTrain[fisher.vars], type = "class")
predicted_test <- predict(dt.fisher, newdata=dTest[fisher.vars], type = "class")
predicted_cal <- predict(dt.fisher, newdata=dCal[fisher.vars], type = "class")
calculate_accuracy <- function(T1) {
diagonal_sum <- sum(diag(T1))
total_sum <- sum(T1)
accuracy <- diagonal_sum / total_sum
return(accuracy)
}
table(dTrain$status, predicted_train)
## predicted_train
## 0 1
## 0 358 35
## 1 65 358
table(dTest$status, predicted_test)
## predicted_test
## 0 1
## 0 49 3
## 1 4 41
table(dCal$status, predicted_cal)
## predicted_cal
## 0 1
## 0 38 5
## 1 6 24
calculate_accuracy(table(dTrain$status, predicted_train))
## [1] 0.877451
calculate_accuracy(table(dTest$status, predicted_test))
## [1] 0.9278351
calculate_accuracy(table(dCal$status, predicted_cal))
## [1] 0.8493151
As anticipated, the confusion matrix and accuracy are consistent for these two sets of attributes. The model achieves an accuracy of approximately 93%, which is notably high.
# Boruta-selected model
response <- outcome
features <- boruta.vars
formula.boruta <- as.formula(paste(response, paste(features, collapse=" + "), sep=" ~ "))
model.boruta <- glm(formula.boruta, data=dTrain, family=binomial(link="logit"))
summary(model.boruta)
##
## Call:
## glm(formula = formula.boruta, family = binomial(link = "logit"),
## data = dTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1640 -1.0309 0.1951 0.9132 1.9855
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.112e+01 6.087e+02 0.035 0.9723
## uploads 9.071e-06 3.929e-06 2.309 0.0210
## Country -2.486e-02 5.611e-03 -4.430 9.42e-06
## video_views_for_the_last_30_days -3.051e-10 1.393e-10 -2.191 0.0285
## subscribers_for_last_30_days 1.943e-06 3.599e-07 5.397 6.76e-08
## created_year -2.614e-02 1.863e-02 -1.403 0.1606
## subscribers_for_last_30_days_isBAD -9.422e-01 1.701e-01 -5.538 3.06e-08
## video_views_for_the_last_30_days_isBAD -1.864e+01 6.087e+02 -0.031 0.9756
##
## (Intercept)
## uploads *
## Country ***
## video_views_for_the_last_30_days *
## subscribers_for_last_30_days ***
## created_year
## subscribers_for_last_30_days_isBAD ***
## video_views_for_the_last_30_days_isBAD
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1130.11 on 815 degrees of freedom
## Residual deviance: 846.03 on 808 degrees of freedom
## AIC: 862.03
##
## Number of Fisher Scoring iterations: 17
dTrain$logpred1 <- predict(model.boruta, newdata=dTrain, type="response")
dTest$logpred1 <- predict(model.boruta, newdata=dTest, type="response")
# Fisher's Score-selected model
response <- outcome
features <- fisher.vars
formula.fisher <- as.formula(paste(response, paste(features, collapse=" + "), sep=" ~ "))
model.fisher <- glm(formula.fisher, data=dTrain, family="binomial")
summary(model.fisher)
##
## Call:
## glm(formula = formula.fisher, family = "binomial", data = dTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1687 -1.0238 0.1864 0.9101 1.9784
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.074e+01 6.118e+02 0.034 0.9730
## video.views 4.290e-12 7.160e-12 0.599 0.5490
## uploads 9.299e-06 3.924e-06 2.370 0.0178
## Country -2.473e-02 5.613e-03 -4.406 1.05e-05
## video_views_for_the_last_30_days -3.027e-10 1.392e-10 -2.175 0.0296
## subscribers_for_last_30_days 1.887e-06 3.616e-07 5.220 1.79e-07
## subscribers_for_last_30_days_isBAD -9.241e-01 1.701e-01 -5.432 5.57e-08
## video_views_for_the_last_30_days_isBAD -1.858e+01 6.118e+02 -0.030 0.9758
##
## (Intercept)
## video.views
## uploads *
## Country ***
## video_views_for_the_last_30_days *
## subscribers_for_last_30_days ***
## subscribers_for_last_30_days_isBAD ***
## video_views_for_the_last_30_days_isBAD
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1130.11 on 815 degrees of freedom
## Residual deviance: 847.64 on 808 degrees of freedom
## AIC: 863.64
##
## Number of Fisher Scoring iterations: 17
dTrain$logpred2 <- predict(model.fisher, newdata=dTrain, type="response")
dTest$logpred2 <- predict(model.fisher, newdata=dTest, type="response")
Based on the summary of both logistic regression models, it is
evident that they share similar coefficients and significance levels for
the common predictor variables. Two predictors, uploads and
subscribers_for_last_30_days, show a positive association
with a higher likelihood of being in the positive class while
subscribers_for_last_30_days_isBAD is strongly associated
with the negative class, signifying its strong predictive power for this
class. On the other hand, the predictor Country is
statistically significant and its negative coefficient suggests that
there are specific countries that have a higher likelihood of belonging
to the positive class than others. Furthermore, both the Boruta-selected
model and the Fisher’s Score-selected model exhibit minimal differences
in residual deviance and AIC scores. This similarity in performance
metrics suggests that both models have a comparable level of complexity
and are performing similarly well in predicting the outcome.
In this section, we define and describe several functions used for performance evaluation of logistic regression models. These functions help us assess the accuracy, precision, recall, F1 score, and AUC of our models to evaluate the effectiveness of logistic regression models in predicting binary outcomes.
# Function to compute accuracy, precision, recall, and F1 score
performanceMeasures <- function(pred, truth, name = "model") {
ctable <- table(truth = truth, pred = (pred > 0.5))
accuracy <- sum(diag(ctable)) / sum(ctable)
precision <- ctable[2, 2] / sum(ctable[, 2])
recall <- ctable[2, 2] / sum(ctable[2, ])
f1 <- 2 * precision * recall / (precision + recall)
data.frame(model = name, precision = precision,
recall = recall,
f1 = f1, accuracy = accuracy)
}
# Function to generate a well-formatted table comparing model performance on training and test datasets
pretty_perf_table <- function(model,training,test) {
library(pander)
# setting up Pander Options
panderOptions("plain.ascii", TRUE)
panderOptions("keep.trailing.zeros", TRUE)
panderOptions("table.style", "simple")
perf_justify <- "lrrrr"
# comparing performance on training vs. test
pred_train <- predict(model, newdata=dTrain)
truth_train <- dTrain[, "status"]
pred_test <- predict(model, newdata=dTest)
truth_test <- dTest[, "status"]
trainperf_tree <- performanceMeasures(
pred_train, truth_train, "logistic, training")
testperf_tree <- performanceMeasures(
pred_test, truth_test, "logistic, test")
perftable <- rbind(trainperf_tree, testperf_tree)
pandoc.table(perftable, justify = perf_justify)
}
# Function to calculate and display a confusion matrix to visualize the model's classification results
confusion_matrix <- function(ytrue, ypred) {
ctable <- table(truth = ytrue, prediction = (ypred > 0.5))
print(ctable)
}
#Function to compute AUC
calculate_auc <- function(model, data, outcome_colname) {
predictions <- predict(model, newdata=data, type="response")
pred <- prediction(predictions, data[[outcome_colname]])
perf <- performance(pred, "auc")
auc <- as.numeric(perf@y.values[[1]])
return(auc)
}
pretty_perf_table(model.boruta, dTrain, dTest)
##
## Attaching package: 'pander'
## The following object is masked from 'package:shiny':
##
## p
##
##
## model precision recall f1 accuracy
## -------------------- ----------- -------- -------- ----------
## logistic, training 0.7895 0.5674 0.6602 0.6973
## logistic, test 0.7568 0.6222 0.6829 0.7320
pretty_perf_table(model.fisher, dTrain, dTest)
##
##
## model precision recall f1 accuracy
## -------------------- ----------- -------- -------- ----------
## logistic, training 0.7954 0.5697 0.6639 0.7010
## logistic, test 0.7500 0.6000 0.6667 0.7216
While both the Boruta-selected and Fisher’s Score-selected models showcase moderate performance, they demonstrate minimal disparities in the performance evaluation metrics. Both models illustrates good performance with each containing an accuracy of approximately 70% on the test set, underscoring their ability to accurately classify unseen data. Despite the marginal differences, the model employing features from the Boruta selection has a slightly higher accuracy rate, at 73% on the test set. Moreover, the precision scores for both models are notably high, suggesting a high accuracy when predicting the positive class, resulting in only a few false positives. In terms of recall, both models show moderate performance, correctly identifying a significant portion data that belongs to a certain class, as shown in the recall values of 0.62 and 0.60 on the test sets. Furthermore, the F1-Scores at around 0.66 to 0.68, indicates a harmonic mean of precision and recall.
confusion_matrix(dTest$status, dTest$logpred1 > 0.5)
## prediction
## truth FALSE TRUE
## 0 32 20
## 1 13 32
confusion_matrix(dTest$status, dTest$logpred2 > 0.5)
## prediction
## truth FALSE TRUE
## 0 34 18
## 1 13 32
The confusion matrices above, tell us that both models have a similar number of True Positives, True Negatives, False Positives, and False Negatives. This implies that the models correctly classify a similar number of positive and negative instances, indicating that they have a balanced classification capability.The number of False Positives is relatively low, which aligns with the high precision scores mentioned earlier.
auc_train <- calculate_auc(model.boruta, dTrain, "status")
auc_test <- calculate_auc(model.boruta, dTest, "status")
print(paste("AUC for training data: ", round(auc_train, 4), "; AUC for test data: ", round(auc_test, 4)))
## [1] "AUC for training data: 0.7987 ; AUC for test data: 0.7654"
auc_train <- calculate_auc(model.fisher, dTrain, "status")
auc_test <- calculate_auc(model.fisher, dTest, "status")
print(paste("AUC for training data: ", round(auc_train, 4), "; AUC for test data: ", round(auc_test, 4)))
## [1] "AUC for training data: 0.7975 ; AUC for test data: 0.762"
The AUC score of the test data for both models is slightly lower than the AUC score of the training data, which is an indication that the model may be slightly overfitting the training data. However, the difference is not substantial, indicating that the model’s performance remains reasonably consistent when applied to unseen data. Notably, both the training and test AUC scores are well above the threshold of 0.5, indicating that the model has discriminatory power and can effectively distinguish between the positive and negative classes.
In this section, we will explore the capabilities of XGBoost, a gradient boosting algorithm, and LIME, a powerful tool for model interpretation to provide concise explanations and practical use cases.
library(xgboost)
# Input:
# - variable_matrix: matrix of input data
# - labelvec: numeric vector of class labels (1 is positive class)
# Returns:
# - xgboost modelR
xgb = function(variable_matrix, labelvec) {
cv <- xgb.cv(variable_matrix, label = labelvec,
params=list(
objective="binary:logistic"
),
nfold=5,
nrounds=100,
print_every_n=10,
metrics="logloss")
evalframe <- as.data.frame(cv$evaluation_log)
NROUNDS <- which.min(evalframe$test_logloss_mean)
model <- xgboost(data=variable_matrix, label=labelvec,
params=list(
objective="binary:logistic"
),
nrounds=NROUNDS,
verbose=FALSE)
model
}
The xgb function trains an XGBoost model for binary classification and perform cross-validation to optimize its hyperparameters. The purpose of this technique is to find the best combination of hyperparameters for a machine learning model. The model undergoes 100 rounds of boosting and during each round, the model corrects its previous errors and updates its predictions based on the knowledge accumulated from previous rounds. As a result, the log loss values for both the training and test data decrease progressively with each boosting round.
dTrain$status <- as.numeric(as.character(dTrain$status)) # xgboost requires numeric labels
input <- as.matrix(dTrain[boruta.vars])
model <- xgb(input, dTrain$status)
## [1] train-logloss:0.530653+0.002449 test-logloss:0.546941+0.006926
## [11] train-logloss:0.176272+0.003009 test-logloss:0.289694+0.022813
## [21] train-logloss:0.106269+0.002478 test-logloss:0.280672+0.028782
## [31] train-logloss:0.069780+0.001416 test-logloss:0.291730+0.029878
## [41] train-logloss:0.049826+0.000851 test-logloss:0.298976+0.031276
## [51] train-logloss:0.038868+0.001246 test-logloss:0.306169+0.035693
## [61] train-logloss:0.031483+0.000810 test-logloss:0.316650+0.039309
## [71] train-logloss:0.026763+0.000815 test-logloss:0.323084+0.042292
## [81] train-logloss:0.023415+0.000915 test-logloss:0.328537+0.046447
## [91] train-logloss:0.021090+0.000909 test-logloss:0.335342+0.046107
## [100] train-logloss:0.019337+0.000734 test-logloss:0.339142+0.047221
The model achieves a log loss of approximately 0.30 to 0.36 on the test data, demonstrating its ability to make accurate predictions. This indicates that the model effectively learns from the data and enhances its predictive performance.
The following code uses the lime function to build an explainer, which takes the feature columns of the training dataset, the model, and puts the continuous variables into 10 bins when making explanations.
library(lime)
##
## Attaching package: 'lime'
## The following object is masked from 'package:dplyr':
##
## explain
explainer <- lime(dTrain[boruta.vars], model = model,
bin_continuous = TRUE, n_bins = 10)
## Warning: subscribers_for_last_30_days_isBAD does not contain enough variance to
## use quantile binning. Using standard binning instead.
## Warning: video_views_for_the_last_30_days_isBAD does not contain enough
## variance to use quantile binning. Using standard binning instead.
cases <- c(3,11,24,49)
(example <- dTest[cases,boruta.vars])
## uploads Country video_views_for_the_last_30_days
## 30 22043.552 47 617638209
## 110 25047.792 17 180374299
## 252 4707.735 47 19409274
## 546 2301.165 47 175430062
## subscribers_for_last_30_days created_year
## 30 128950.13 5
## 110 152343.48 8
## 252 -97591.25 4
## 546 295223.40 15
## subscribers_for_last_30_days_isBAD video_views_for_the_last_30_days_isBAD
## 30 1 1
## 110 1 1
## 252 1 1
## 546 2 1
explanation <- lime::explain(example, explainer, n_labels = 1, n_features = 7)
ggsave("plot1.png", plot = plot_features(explanation), width = 12, height = 6)
From the results of these random
cases, the LIME plot illustrates an understanding of how different
features influence the model’s decision, highlighting that not all
intuitively positive metrics such as high views or subscribers, support
high earning possibilities. If we look at case 30, it is classified as a
high earner with 85% probability with support predictors
uploads and
video_views_for_the_last_30_days_is_BAD for the high earner
classification. While its contradicting features suggests that a
significant number of video views, new subscribers, country (in this
case United States), and channel’s age the high earner status. Overall,
we can see from the plot that uploads is the determining
feature for the classification of these four instances. The rest of the
features are less significant.
input <- as.matrix(dTrain[fisher.vars])
model <- xgb(input, dTrain$status)
## [1] train-logloss:0.532536+0.002052 test-logloss:0.552204+0.006971
## [11] train-logloss:0.174255+0.007759 test-logloss:0.303433+0.019853
## [21] train-logloss:0.105810+0.008274 test-logloss:0.288477+0.038694
## [31] train-logloss:0.068876+0.004367 test-logloss:0.291952+0.047568
## [41] train-logloss:0.049067+0.002822 test-logloss:0.305477+0.059234
## [51] train-logloss:0.037897+0.002131 test-logloss:0.315562+0.062005
## [61] train-logloss:0.030824+0.001700 test-logloss:0.322023+0.066567
## [71] train-logloss:0.025966+0.001228 test-logloss:0.326077+0.069598
## [81] train-logloss:0.022798+0.001031 test-logloss:0.334768+0.071992
## [91] train-logloss:0.020458+0.000954 test-logloss:0.341398+0.073774
## [100] train-logloss:0.018802+0.000758 test-logloss:0.345324+0.075844
The model achieves a log loss of approximately 0.31 to 0.39 on the test data, slightly higher than that of the log loss from the xgboost model that used the features generated by Boruta. While it is slightly higher, the range of log-loss is still relatively low, which indicates that it is still making reasonably accurate probabilistic predictions.
explainer <- lime(dTrain[fisher.vars], model = model,
bin_continuous = TRUE, n_bins = 10)
## Warning: subscribers_for_last_30_days_isBAD does not contain enough variance to
## use quantile binning. Using standard binning instead.
## Warning: video_views_for_the_last_30_days_isBAD does not contain enough
## variance to use quantile binning. Using standard binning instead.
cases <- c(3,11,24,49)
(example <- dTest[cases,fisher.vars])
## video.views uploads Country video_views_for_the_last_30_days
## 30 2638870777 22043.552 47 617638209
## 110 4126755625 25047.792 17 180374299
## 252 11488891033 4707.735 47 19409274
## 546 48990569078 2301.165 47 175430062
## subscribers_for_last_30_days subscribers_for_last_30_days_isBAD
## 30 128950.13 1
## 110 152343.48 1
## 252 -97591.25 1
## 546 295223.40 2
## video_views_for_the_last_30_days_isBAD
## 30 1
## 110 1
## 252 1
## 546 1
explanation <- lime::explain(example, explainer, n_labels = 1, n_features = 7)
ggsave("plot2.png", plot = plot_features(explanation), width = 12, height = 6)
Similar to the LIME plot generated
from the feature combination 1 identified by Boruta, the feature set
derived from Fisher’s Score also produced comparable results.
Classification is a supervised learning technique used to categorize data points into predefined labels based on their features. Its purpose is to predict the predefined target label of a data point based on its characteristics, utilizing labeled training data to understand the relationships between different categories.
Our exploration into the classification of the given dataset provided
significant insights. Starting from analyzing single variable models, it
was evident that the features uploads and
subscribers for the last 30 days stood out in performance
as evidenced by their AUC scores and the ROC plot analysis. This
observation was reinforced when we delved into multivariable
classification models.
When we integrated the feature combinations generated by Boruta and Fisher’s score into different models, the results were highly similar. Specifically, in the logistic regression models, the 1% variance in accuracy and similar AUC scores between the two feature combinations suggests the consistent predictive power of the shared variables in both combinations. Meanwhile, the XGBoost model presented marginally better performance in the Boruta feature set over the Fisher’s set, but the differences are not large enough to definitively favor one over the other. The minimal discrepancy between the results might be due to the fact that the two combinations only differed by a single variable.
However, it is the Decision Tree model that offered a notable
revelation. While both feature combinations produced almost identical
trees, the standout finding was that just two features,
uploads and subscribers for the last 30 days,
were enough to yield an accuracy rate of 92%. This suggests that the
other predictors might not be as influential.
Given the above findings, it becomes evident that the Decision Tree model, specifically when reduced to the two aforementioned significant predictors, emerges as the most efficient choice for classification in this scenario. The ability of the Decision Tree to achieve these results with fewer predictors highlights its efficiency and potential for scalability in future applications.
We aim to cluster YouTubers based on their content creation habits, user engagement, and viewership statistics. We do not need a target variable for clustering. Our dataset consists of both numeric and categorical features, which we will convert into numeric values. Since these features have different scales, we will standardize them before using Euclidean Distance as the similarity measure for our clustering analysis.
As part of the pre-processing step, we will clean and prepare the YouTuber names, as we intend to use them for visualizing the results of the clustering analysis.
df_gdp[grepl("^ý.*ý$", df_gdp$Youtuber), ]
There are 18 invalid Youtuber names (only contain “ý” symbol). To ensure the integrity of our data and meaningful analysis, we will exclude these invalid names from the dataset. Some Youtuber names still contain the “ý” and “�” symbol. To improve readability, we will replace these symbols with blanks.
df_gdp <- df_gdp %>%
filter(!grepl("^ý.*ý$", Youtuber)) %>%
mutate(Youtuber = gsub("ý", "", Youtuber)) %>%
mutate(Youtuber = gsub("�", "", Youtuber))
# Convert the Country column to a be numeric column
country_mapping <- c(
"Afghanistan" = 1, "Andorra" = 2, "Argentina" = 3, "Australia" = 4,
"Bangladesh" = 5, "Barbados" = 6, "Brazil" = 7, "Canada" = 8,
"Chile" = 9, "China" = 10, "Colombia" = 11, "Cuba" = 12,
"Ecuador" = 13, "Egypt" = 14, "El Salvador" = 15, "Finland" = 16,
"France" = 17, "Germany" = 18, "India" = 19, "Indonesia" = 20,
"Iraq" = 21, "Italy" = 22, "Japan" = 23, "Jordan" = 24,
"Kuwait" = 25, "Latvia" = 26, "Malaysia" = 27, "Mexico" = 28,
"Morocco" = 29, "Netherlands" = 30, "Pakistan" = 31, "Peru" = 32,
"Philippines" = 33, "Russia" = 34, "Samoa" = 35, "Saudi Arabia" = 36,
"Singapore" = 37, "South Korea" = 38, "Spain" = 39, "Sweden" = 40,
"Switzerland" = 41, "Thailand" = 42, "Turkey" = 43, "Ukraine" = 44,
"United Arab Emirates" = 45, "United Kingdom" = 46, "United States" = 47,
"Venezuela" = 48, "Vietnam" = 49)
df_gdp$Country <- country_mapping[df_gdp$Country]
# Convert the category column to a be numeric column
category_mapping <- c(
"Autos & Vehicles" = 1, "Comedy" = 2, "Education" = 3, "Entertainment" = 4,
"Film & Animation" = 5, "Gaming" = 6, "Howto & Style" = 7, "Movies" = 8,
"Music" = 9, "News & Politics" = 10, "Nonprofits & Activism" = 11, "People & Blogs" = 12,
"Pets & Animals" = 13, "Science & Technology" = 14, "Shows" = 15, "Sports" = 16,
"Trailers" = 17, "Travel & Events" = 18)
df_gdp$category <- category_mapping[df_gdp$category]
df_gdp <- df_gdp %>%
mutate(created_year = as.numeric(created_year)) %>%
mutate(subscribers_for_last_30_days_isBAD = as.numeric(subscribers_for_last_30_days_isBAD)) %>%
mutate(video_views_for_the_last_30_days_isBAD = as.numeric(video_views_for_the_last_30_days_isBAD))
df.clust <- df_gdp[, -c(1:2, 8, 15:16)] %>% # Exclude irreverent columns
scale() %>% # Scale data
as.data.frame()
We will compare the behavior of complete, ward.D2, and single linkage methods to cluster YouTubers into clusters.
set.seed(123)
random_data <- df.clust[sample(nrow(df.clust), 100), ]
d.r <- dist(random_data, method = "euclidean")
pfit.c.r <- hclust(d.r, method = "complete")
pfit.w.r <- hclust(d.r, method = "ward.D2")
pfit.s.r <- hclust(d.r, method = "single")
plot(pfit.c.r, main="Cluster Dendrogram for Complete Linkage", cex = 0.6)
rect.hclust(pfit.c.r, k=3)
plot(pfit.w.r, main="Cluster Dendrogram for Ward Linkage", cex = 0.6)
rect.hclust(pfit.w.r, k=3)
plot(pfit.s.r, main="Cluster Dendrogram for Single Linkage", cex = 0.6)
rect.hclust(pfit.s.r, k=3)
These random samples of data for the dendrogram plot provide limited insights and are challenging to interpret.
d <- dist(df.clust, method = "euclidean")
pfit.c <- hclust(d, method = "complete")
pfit.w <- hclust(d, method = "ward.D2")
pfit.s <- hclust(d, method = "single")
k <- 1:5
groups.c <- cutree(pfit.c, k = k)
groups.w <- cutree(pfit.w, k = k)
groups.s <- cutree(pfit.s, k = k)
groups.c.fp = as.data.frame(groups.c) %>%
pivot_longer(cols = k, names_to = "cluster", values_to = "level")
conf.c = xtabs(~level + cluster, groups.c.fp)
groups.w.fp = as.data.frame(groups.w) %>%
pivot_longer(cols = k, names_to = "cluster", values_to = "level")
conf.w = xtabs(~level + cluster, groups.w.fp)
groups.s.fp = as.data.frame(groups.s) %>%
pivot_longer(cols = k, names_to = "cluster", values_to = "level")
conf.s = xtabs(~level + cluster, groups.s.fp)
cat("Complete Linkage: \n")
## Complete Linkage:
conf.c
## cluster
## level 1 2 3 4 5
## 1 968 11 3 3 3
## 2 0 957 957 955 955
## 3 0 0 8 8 1
## 4 0 0 0 2 7
## 5 0 0 0 0 2
cat("Ward.D2 Linkage: \n")
## Ward.D2 Linkage:
conf.w
## cluster
## level 1 2 3 4 5
## 1 968 13 13 13 13
## 2 0 955 630 630 596
## 3 0 0 325 281 34
## 4 0 0 0 44 281
## 5 0 0 0 0 44
cat("Single Linkage: \n")
## Single Linkage:
conf.s
## cluster
## level 1 2 3 4 5
## 1 968 966 965 1 1
## 2 0 2 1 964 963
## 3 0 0 2 1 1
## 4 0 0 0 2 2
## 5 0 0 0 0 1
Since the dendrogram is challenging to interpret, we have decided to create a table showing the number of observations in each cluster. All three linkage methods exhibit distinct characteristics. Single linkage tends to keep observations isolated within each cluster, while ward.D2 aims to minimize the variance within the clusters.
k <- 5
groups.c <- cutree(pfit.c, k = k)
groups.w <- cutree(pfit.w, k = k)
groups.s <- cutree(pfit.s, k = k)
print_clusters <- function(df, groups, cols_to_print) {
Ngroups <- max(groups)
for (i in 1:Ngroups) {
print(paste("cluster", i))
print(df[groups == i, cols_to_print])
}
}
cols_to_print <- c("Youtuber", "subscribers", "highest_yearly_earnings", "Country")
print_clusters(df_gdp, groups.c, cols_to_print)
For Hierarchical Clustering using complete linkage, we can observe
that at level 1 of hclust with 3, 4, and 5 clusters, it remains the same
with 3 observations. We can utilize the print_clusters()
function to access and identify these observations. In this case, the
observations are T-Series, Cocomelon - Nursery Rhymes, and SET India. We
might assume that this group is based on a high number of subscribers,
highest yearly earnings, and being from India. However, this
interpretation is for exploratory purposes, and the clustering
interpretation is highly subjective.
princ <- prcomp(df.clust)
nComp <- 2
project2D <- as.data.frame(predict(princ, newdata=df.clust)[, 1:nComp])
hclust.c.project2D <- cbind(project2D, cluster=as.factor(groups.c), Youtuber=df_gdp$Youtuber)
hclust.w.project2D <- cbind(project2D, cluster=as.factor(groups.w), Youtuber=df_gdp$Youtuber)
hclust.s.project2D <- cbind(project2D, cluster=as.factor(groups.s), Youtuber=df_gdp$Youtuber)
# Finding convex hull
library(grDevices)
find_convex_hull <- function(proj2Ddf, groups) {
do.call(rbind,
lapply(unique(groups),
FUN = function(c) {
f <- subset(proj2Ddf, cluster==c);
f[chull(f),]
}
)
)
}
hclust.c.hull <- find_convex_hull(hclust.c.project2D, groups.c)
hclust.w.hull <- find_convex_hull(hclust.w.project2D, groups.w)
hclust.s.hull <- find_convex_hull(hclust.s.project2D, groups.s)
ggplot(hclust.c.project2D, aes(x=PC1, y=PC2)) +
geom_point(aes(shape=cluster, color=cluster)) +
geom_text(aes(label=Youtuber, color=cluster), hjust=0, vjust=1, size=3) +
geom_polygon(data=hclust.c.hull, aes(group=cluster, fill=as.factor(cluster)), alpha=0.4, linetype=0) +
ggtitle("Complete Linkage")
ggplot(hclust.w.project2D, aes(x=PC1, y=PC2)) +
geom_point(aes(shape=cluster, color=cluster)) +
geom_text(aes(label=Youtuber, color=cluster), hjust=0, vjust=1, size=3) +
geom_polygon(data=hclust.w.hull, aes(group=cluster, fill=as.factor(cluster)), alpha=0.4, linetype=0) +
ggtitle("Ward’s Linkage")
ggplot(hclust.s.project2D, aes(x=PC1, y=PC2)) +
geom_point(aes(shape=cluster, color=cluster)) +
geom_text(aes(label=Youtuber, color=cluster), hjust=0, vjust=1, size=3) +
geom_polygon(data=hclust.s.hull, aes(group=cluster, fill=as.factor(cluster)), alpha=0.4, linetype=0) +
ggtitle("Single Linkage")
When comparing complete and Ward linkage methods, it is observed that Ward linkage tends to form larger clusters, whereas complete linkage creates a combination of very small and very large clusters.
library(fpc)
kbest.p <- 12
cboot.hclust.c <- clusterboot(df.clust, clustermethod = hclustCBI, method = "complete", k = kbest.p)
## boot 1
## boot 2
## boot 3
## boot 4
## boot 5
## boot 6
## boot 7
## boot 8
## boot 9
## boot 10
## boot 11
## boot 12
## boot 13
## boot 14
## boot 15
## boot 16
## boot 17
## boot 18
## boot 19
## boot 20
## boot 21
## boot 22
## boot 23
## boot 24
## boot 25
## boot 26
## boot 27
## boot 28
## boot 29
## boot 30
## boot 31
## boot 32
## boot 33
## boot 34
## boot 35
## boot 36
## boot 37
## boot 38
## boot 39
## boot 40
## boot 41
## boot 42
## boot 43
## boot 44
## boot 45
## boot 46
## boot 47
## boot 48
## boot 49
## boot 50
## boot 51
## boot 52
## boot 53
## boot 54
## boot 55
## boot 56
## boot 57
## boot 58
## boot 59
## boot 60
## boot 61
## boot 62
## boot 63
## boot 64
## boot 65
## boot 66
## boot 67
## boot 68
## boot 69
## boot 70
## boot 71
## boot 72
## boot 73
## boot 74
## boot 75
## boot 76
## boot 77
## boot 78
## boot 79
## boot 80
## boot 81
## boot 82
## boot 83
## boot 84
## boot 85
## boot 86
## boot 87
## boot 88
## boot 89
## boot 90
## boot 91
## boot 92
## boot 93
## boot 94
## boot 95
## boot 96
## boot 97
## boot 98
## boot 99
## boot 100
groups.cboot.c <- cboot.hclust.c$result$partition
cboot.hclust.w <- clusterboot(df.clust, clustermethod = hclustCBI, method = "ward.D2", k = kbest.p)
## boot 1
## boot 2
## boot 3
## boot 4
## boot 5
## boot 6
## boot 7
## boot 8
## boot 9
## boot 10
## boot 11
## boot 12
## boot 13
## boot 14
## boot 15
## boot 16
## boot 17
## boot 18
## boot 19
## boot 20
## boot 21
## boot 22
## boot 23
## boot 24
## boot 25
## boot 26
## boot 27
## boot 28
## boot 29
## boot 30
## boot 31
## boot 32
## boot 33
## boot 34
## boot 35
## boot 36
## boot 37
## boot 38
## boot 39
## boot 40
## boot 41
## boot 42
## boot 43
## boot 44
## boot 45
## boot 46
## boot 47
## boot 48
## boot 49
## boot 50
## boot 51
## boot 52
## boot 53
## boot 54
## boot 55
## boot 56
## boot 57
## boot 58
## boot 59
## boot 60
## boot 61
## boot 62
## boot 63
## boot 64
## boot 65
## boot 66
## boot 67
## boot 68
## boot 69
## boot 70
## boot 71
## boot 72
## boot 73
## boot 74
## boot 75
## boot 76
## boot 77
## boot 78
## boot 79
## boot 80
## boot 81
## boot 82
## boot 83
## boot 84
## boot 85
## boot 86
## boot 87
## boot 88
## boot 89
## boot 90
## boot 91
## boot 92
## boot 93
## boot 94
## boot 95
## boot 96
## boot 97
## boot 98
## boot 99
## boot 100
groups.cboot.w <- cboot.hclust.w$result$partition
cat("Complete Linkage: \n")
## Complete Linkage:
(values <- 1 - cboot.hclust.c$bootbrd/100)
## [1] 0.88 0.34 0.54 0.17 0.62 1.00 0.95 0.69 0.30 0.84 0.50 0.53
cat("So clusters", order(values)[12], "and", order(values)[11], "are highly stable \n")
## So clusters 6 and 7 are highly stable
cat("Ward's Linkage: \n")
## Ward's Linkage:
(values <- 1 - cboot.hclust.w$bootbrd/100)
## [1] 0.66 0.40 0.65 0.59 0.86 0.76 0.29 0.72 1.00 1.00 0.94 0.83
cat("So clusters", order(values)[12], "and", order(values)[11], "are highly stable \n")
## So clusters 10 and 9 are highly stable
# Function to return the squared Euclidean distance of two given points x and y
sqr_euDist <- function(x, y) {
sum((x - y)^2)
}
# Function to calculate WSS of a cluster
wss <- function(clustermat) {
c0 <- colMeans(clustermat)
sum(apply( clustermat, 1, FUN=function(row) {sqr_euDist(row, c0)} ))
}
# Function to calculate the total WSS
wss_total <- function(scaled_df, labels) {
wss.sum <- 0
k <- length(unique(labels))
for (i in 1:k)
wss.sum <- wss.sum + wss(subset(scaled_df, labels == i))
wss.sum
}
# Function to calculate total sum of squared (TSS)
tss <- function(scaled_df) {
wss(scaled_df)
}
# Function to return the CH indices computed using complete linkage
CH_index.c <- function(scaled_df, kmax, method="kmeans") {
if (!(method %in% c("kmeans", "hclust")))
stop("method must be one of c('kmeans', 'hclust')")
npts <- nrow(scaled_df)
wss.value <- numeric(kmax) # create a vector of numeric type
# wss.value[1] stores the WSS value for k=1 (when all the
# data points form 1 large cluster).
wss.value[1] <- wss(scaled_df)
if (method == "kmeans") {
# kmeans
for (k in 2:kmax) {
clustering <- kmeans(scaled_df, k, nstart=10, iter.max=100)
wss.value[k] <- clustering$tot.withinss
}
} else {
# hclust
d <- dist(scaled_df, method="euclidean")
pfit <- hclust(d, method="complete")
for (k in 2:kmax) {
labels <- cutree(pfit, k=k)
wss.value[k] <- wss_total(scaled_df, labels)
}
}
bss.value <- tss(scaled_df) - wss.value # this is a vector
B <- bss.value / (0:(kmax-1)) # also a vector
W <- wss.value / (npts - 1:kmax) # also a vector
data.frame(k = 1:kmax, CH_index = B/W, WSS = wss.value)
}
# Function to return the CH indices computed using ward.D2 linkage
CH_index.w <- function(scaled_df, kmax, method="kmeans") {
if (!(method %in% c("kmeans", "hclust")))
stop("method must be one of c('kmeans', 'hclust')")
npts <- nrow(scaled_df)
wss.value <- numeric(kmax) # create a vector of numeric type
# wss.value[1] stores the WSS value for k=1 (when all the
# data points form 1 large cluster).
wss.value[1] <- wss(scaled_df)
if (method == "kmeans") {
# kmeans
for (k in 2:kmax) {
clustering <- kmeans(scaled_df, k, nstart=10, iter.max=100)
wss.value[k] <- clustering$tot.withinss
}
} else {
# hclust
d <- dist(scaled_df, method="euclidean")
pfit <- hclust(d, method="ward.D2")
for (k in 2:kmax) {
labels <- cutree(pfit, k=k)
wss.value[k] <- wss_total(scaled_df, labels)
}
}
bss.value <- tss(scaled_df) - wss.value # this is a vector
B <- bss.value / (0:(kmax-1)) # also a vector
W <- wss.value / (npts - 1:kmax) # also a vector
data.frame(k = 1:kmax, CH_index = B/W, WSS = wss.value)
}
# complete linkage
crit.df <- CH_index.c(df.clust, 12, method = "kmeans")
fig1 <- ggplot(crit.df, aes(x=k, y=CH_index)) +
geom_point() + geom_line(colour="red") +
scale_x_continuous(breaks=1:12, labels=1:12) +
labs(y="CH index")
fig2 <- ggplot(crit.df, aes(x=k, y=WSS), color="blue") +
geom_point() + geom_line(colour="blue") +
scale_x_continuous(breaks=1:12, labels=1:12)
grid.arrange(fig1, fig2, nrow=1)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).
# ward.D2 linkage
crit.df <- CH_index.w(df.clust, 12, method = "hclust")
fig1 <- ggplot(crit.df, aes(x=k, y=CH_index)) +
geom_point() + geom_line(colour="red") +
scale_x_continuous(breaks=1:12, labels=1:12) +
labs(y="CH index")
fig2 <- ggplot(crit.df, aes(x=k, y=WSS), color="blue") +
geom_point() + geom_line(colour="blue") +
scale_x_continuous(breaks=1:12, labels=1:12)
grid.arrange(fig1, fig2, nrow=1)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Removed 1 row containing missing values (`geom_line()`).
When using hierarchical clustering with the Euclidean distance metric and the complete linkage method, we find that the CH index is maximized at k = 2, while the ward.D2 linkage method results in k = 10. The WSS plot does not exhibit a clear elbow point.
kmClustering.ch <- kmeansruns(df.clust, krange = 1:12, criterion = "ch")
kmClustering.ch$bestk
## [1] 2
kmClustering.asw <- kmeansruns(df.clust, krange = 1:12, criterion = "asw")
kmClustering.asw$bestk
## [1] 2
kmCritframe <- data.frame(k=1:12, ch=kmClustering.ch$crit, asw=kmClustering.asw$crit)
fig1 <- ggplot(kmCritframe, aes(x=k, y=ch)) +
geom_point() + geom_line(colour="red") +
scale_x_continuous(breaks=1:12, labels=1:12) +
labs(y="CH index")
fig2 <- ggplot(kmCritframe, aes(x=k, y=asw)) +
geom_point() + geom_line(colour="blue") +
scale_x_continuous(breaks=1:12, labels=1:12) +
labs(y="ASW")
grid.arrange(fig1, fig2, nrow=1)
Using k-means clustering, CH index and AWS is maximized at k = 2 clusters.
fig <- c()
kvalues <- seq(2, 5)
for (k in kvalues) {
set.seed(3064)
groups <- kmeans(df.clust, k, nstart = 100, iter.max = 100)$cluster
kmclust.project2D <- cbind(project2D, cluster = as.factor(groups), Youtuber = df_gdp$Youtuber)
kmclust.hull <- find_convex_hull(kmclust.project2D, groups)
assign(paste0("fig", k),
ggplot(kmclust.project2D, aes(x = PC1, y = PC2)) +
geom_point(aes(shape = cluster, color = cluster)) +
geom_polygon(data = kmclust.hull, aes(group = cluster, fill = cluster),
alpha = 0.4, linetype = 0) +
labs(title = sprintf("k = %d", k)) +
theme(legend.position ="none")
)
}
grid.arrange(fig2, fig3, fig4, fig5,nrow = 2)
From these PCA plots, it’s evident that k-means clustering consistently forms a single cluster on the left-hand side. As we increase the value of k, it breaks down the cluster on the right-hand side into subclusters.
k_values <- 2:5
table <- matrix(0, nrow = length(k_values), ncol = max(k_values))
rownames(table) <- 2:5
colnames(table) <- 1:5
for (k_index in 1:length(k_values)) {
set.seed(3064)
k <- k_values[k_index]
kmeans_result <- kmeans(df.clust, centers = k, nstart = 100, iter.max = 100)
cluster_sizes <- table(kmeans_result$cluster)
for (level in 1:k) {
table[k_index, level] <- cluster_sizes[level]
}
}
cat("The number of observations in each cluster using k-means clustering \n")
## The number of observations in each cluster using k-means clustering
t(table)
## 2 3 4 5
## 1 57 599 292 46
## 2 911 31 599 31
## 3 0 338 31 328
## 4 0 0 46 271
## 5 0 0 0 292
We calculate the mean of each feature within each cluster to gain insights into the relationships between them.
set.seed(3064)
km <- kmeans(df.clust, 5, nstart = 100, iter.max = 100)
df_with_clusters <- df.clust %>% mutate(Cluster = km$cluster)
cluster_means <- df_with_clusters %>% group_by(Cluster) %>% summarise_all(mean)
ggplot(cluster_means) +
geom_line(aes(x = Cluster, y = subscribers, linetype = "Subscribers", color = "Subscribers")) +
geom_line(aes(x = Cluster, y = video.views, linetype = "Video Views", color = "Video Views")) +
geom_line(aes(x = Cluster, y = uploads, linetype = "Uploads", color = "Uploads")) +
geom_line(aes(x = Cluster, y = video_views_for_the_last_30_days, linetype = "Video Views Last 30 Days", color = "Video Views Last 30 Days")) +
geom_line(aes(x = Cluster, y = highest_yearly_earnings, linetype = "Highest Yearly Earnings", color = "Highest Yearly Earnings")) +
geom_line(aes(x = Cluster, y = subscribers_for_last_30_days, linetype = "Subscribers Last 30 Days", color = "Subscribers Last 30 Days")) +
scale_x_continuous(breaks = 1:9, labels = 1:9) +
labs(title = "Mean of each feature by cluster", x = "cluster", y = "mean") +
scale_color_manual(values = c("Subscribers" = "blue", "Video Views" = "green", "Uploads" = "red", "Video Views Last 30 Days" = "purple", "Highest Yearly Earnings" = "orange", "Subscribers Last 30 Days" = "black")) +
scale_linetype_manual(values = c("Subscribers" = "solid", "Video Views" = "dashed", "Uploads" = "dotted", "Video Views Last 30 Days" = "longdash", "Highest Yearly Earnings" = "twodash", "Subscribers Last 30 Days" = "solid"))
Cluster 2 exhibits higher mean values for each feature compared to the other clusters. In contrast, the means in cluster 3 are generally close to 0, while cluster 5 shows negative means. Given that the features are scaled using z-normalization, we can conclude that cluster 2 represents YouTubers who are highly famous, with above-average income, subscribers, and video views. Cluster 3 represents average YouTubers, while cluster 5 represents less popular or non-famous ones.
Clustering is an unsupervised learning technique used to group similar data points together based on their inherent similarities. Its purpose is to unveil natural patterns or groupings within the data without relying on a predefined target variable.
Hierarchical clustering proves to be particularly useful when the optimal number of clusters is unknown, allowing for an exploration of clustering behavior based on different distance metrics and linkage methods. The choice of distance and linkage depends on the characteristics of the data and the objectives of the analysis. In contrast, K-means clustering requires the user to specify the desired number of clusters beforehand.
Our clustering results exhibit a significant difference between the optimal choice of k = 10 from hierarchical clustering and k = 2 K-means clustering. It represents a trade-off between complexity and interpretability. Higher values of k can lead to more complex clusters, but the interpretation and insights may become challenging to discern. Conversely, a smaller k, such as k = 2, is easier to interpret but may oversimplify the underlying patterns, potentially leading to a loss of valuable information.